CHAPTER 8 Getting Your Data into the Computer 109
If you impute a date, just create a new column with the imputed date, because you
want to be cautious. Make sure to keep the original partial date for traceability.
Any date imputation should be consistent with the study protocol, and not bias the
results. Completely missing dates should be left blank, as statistical software
treats blank cells as missing data.
Because of the way most statistics programs store dates and times, they can easily
calculate intervals between any two points in time by simple subtraction. It is best
practices to store raw dates and times, and let the computer calculate the intervals
later (rather than calculate them yourself). For example, if you create variables for
date of birth (DOB) and a visit date (VisDt) in Excel, you can calculate an accurate
age at the time of the visit with this formula:
Age
VisDt
DOB
(
) /
.
365 25
Checking Your Entered Data for Errors
After you’ve entered all your data into the computer, there are a few things you
can do to check for errors:»
» Examine the smallest and largest values in numerical data: Have the
software show you the smallest and largest values for each numerical variable.
This check can often catch decimal-point errors (such as a hemoglobin value of
125 g/dL instead of 12.5 g/dL) or transposition errors (for example, a weight of
517 pounds instead of 157 pounds).»
» Sort the values of variables: If your program can show you a sorted list of all
the values for a variable, that’s even better — it often shows misclassified
categories as well as numerical outliers.»
» Search for blanks and commas: You can have Excel search for blanks
in category values that shouldn’t have blanks, or for commas in numeric
variables. Make sure the “Match entire cell contents” option is deselected in
the Find and Replace dialog box (you may have to click the Options button to
see the check box). This operation can also be done using statistical software.
Be wary if there a large number of missing values, because this could indicate
a data collection problem.»
» Tabulate categorical variables: You can have your statistics program tabulate
each categorical variable (showing you the frequency each different category
occurred in your data). This check usually finds misclassified categories. Note
that blanks and special characters in character variables may cause incorrect
results when querying, which is why it is important to do this check.